Univariate Plots Section
Univariate Analysis
Bivariate Plots Section
Bivariate Analysis
Multivariate Plots Section
Multivariate Analysis
Final Plots and Summary
Reflection
References
I used to own an international chocolate brand. At times we marketed the product by pairing it with wine. It would have been helpful having expert knowledge of wine, in addition to chocolate, when pairing the products in order to refine the tasting experience for the customer. I anticipate after this project I’ll have a better understanding of which features most affect the quality of the wine.
In this project, I will use the R programming language and apply exploratory data analysis techniques to:
After loading packages, uploading the data file (‘wineQualityReds.csv’), and tidying the data set we are ready for our analysis. I renamed column X to ‘wine.id’ and converted quality to factored variable called ‘rating’: ‘not recommended’ (0,1), ‘mediocre’ (2,3), ‘good’ (4,5), ‘very good’ (6,7), ‘outstanding’ (8,9),‘classic’ (10); ratings according to Wine Spectator’s 100-point scale.
While over 500 chemical compounds have been identified in wine, most produced naturally during fermentation, all wines have some basic elements in common including acid & sugar.
During fermentation, sugar is turned into alcohol when the skin of a ripe grape separates & the sugary juice on the inside makes contact with yeasts living naturally in the air & on surface of grape skin. Yeasts voraciously eat their way through the sugar and convert it into alcohol (leftover sugar makes for a sweeter wine).
Acid, when present in the right proportion, results in an intense & refreshing wine. Acid has the added benefit of acting as a preservative, while alcohol balances other flavors.
Other factors such as variety of grapes, optimum ripeness & yield, and soil quality contribute to the perfect wine, but ultimately we’re looking for an optimum balance of sugar and acidity.
Particularly, I would like to analyze these three variables (acidity, sugar, alcohol) and their affect on the quality of red wine in our data set. During the data analysis, I anticipate I may come across other variables that could also affect the quality of wine.
Our data set is limited to red variants of the Portuguese “Vinho Verde” wine, so I’m not sure how accurate modeling wine quality will be relative to red wine in general. Is our sample size large enough to draw robust conclusions about quality and make accurate predictions?
Let’s explore the data set more broadly to understand its structure.
## [1] 1599 14
## 'data.frame': 1599 obs. of 14 variables:
## $ wine.id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ rating : Ord.factor w/ 6 levels "not recommended"<..: 4 4 4 4 4 4 4 5 5 4 ...
## wine.id fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating
## Min. : 8.40 Min. :3.000 not recommended: 0
## 1st Qu.: 9.50 1st Qu.:5.000 mediocre : 0
## Median :10.20 Median :6.000 good : 63
## Mean :10.42 Mean :5.636 very good :1319
## 3rd Qu.:11.10 3rd Qu.:6.000 outstanding : 217
## Max. :14.90 Max. :8.000 classic : 0
Let’s plot quality and rating to understand this distribution; afterall quality is our main focus in this analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## not recommended mediocre good very good
## 0 0 63 1319
## outstanding classic
## 217 0
The quality of wines range from 3.00 to 8.00 with a median of 6.00. 82.5% (1,319) of the wines are rated “very good”, per our rating, which fall right in the middle of the distribution. The data set lacks the very worst and very best quality wines, which may affect our models. It will be interesting to examine the distributions of each variable and investigate any correlations between variables.
Going forward, I will use ggplot syntax when plotting the distributions of the engagement variables, which typically have very long tails; i.e. orders of magnitude.
In the second plot, I’ll use log10 to transform the data into a normal distribution in order to see patterns more clearly without being distracted by tails. Linear regression assumes variables have normal distributions.
It is preferrable to use scale_x_log10 because we’re typically looking at actual counts rather than log units (log10 wrapper).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
fixed.acidity has a median of 7.90 and is positively skewed with some outliers greater than 15.0. A log10 scaling layer normalizes the distribution and eliminates the extreme outliers. Most of the acids involved with wine are fixed acids, with the notable exception of acetic acid (volatile.acidity). Presumably, volatile acids will have a greater effect on quality than fixed, as too high a concentration results in a vinegary taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
volatile.acidity is also positively skewed. Transforming the data presents a slightly bimodal distribution with peaks around 0.4 and 0.7. There are some outliers of higher acidic wines greater than 1.0, but the majority are around 0.5 in the normalized distribution. Volatile acidity is mostly caused by bacteria in the wine resulting in a vinegary taste. This is the first of three variables I suggest will effect wine quality most, so I will pay attention to its impact on quality as well as its correlation to other variables in the next section.
Let’s summarise to look closer at wine by acidity:
## # A tibble: 6 x 6
## quality mean_acid median_acid min_acid max_acid n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 0.8845000 0.845 0.44 1.580 10
## 2 4 0.6939623 0.670 0.23 1.130 53
## 3 5 0.5770411 0.580 0.18 1.330 681
## 4 6 0.4974843 0.490 0.16 1.040 638
## 5 7 0.4039196 0.370 0.12 0.915 199
## 6 8 0.4233333 0.370 0.26 0.850 18
The mean and median acidity levels decrease as wine quality increases, indicating a negative correlation to wine quality. We’ll confirm this hunch in the next section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
citric.acid is an interesting distribution as it changes from long-tail right to long-tail left after transforming the data. The majority of wines have small amounts of citric.acid, with lots having none at all. In our initial observations, it is the only variable with a minimum value of 0.00; perhaps indicating missing data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
residual.sugar is very much positively skewed with extreme outliers far away from the median. Sugar content ranges widely from 0.90 to 15.50. There are a lot of wines with low sugar content between 1.0 - 3.0 (no sweet wines in our data set). Sugar is the second of three variables I predict will effect wine quality most.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
chlorides is similar to residual.sugar, having high concentrations around the median. Majority of wines have have low chloride levels roughly between 0.01 - 0.2. Transforming the data normalizes the distribution and shows more clearly most wines have chloride values between 0.05 - 0.10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
free.sulfur.dioxide is positively skewed with extreme outliers greater than 50. High concentrations can affect the smell and taste of wine. Transforming the data eliminates some outliers and returns a bimodal distribution, with peaks around 7.0 and 11.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
total.sulfur.dioxide is positively skewed, also with extreme outliers greater 280. A few wines have concentrations around 280, while at 30 there are over 300 wines. Upon transformation the distribution looks more normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
density looks to have a very normal distribution, with & without transformation, with little-to-no outliers. Density is affected by alcohol and sugar so we’ll examine further in the next section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH also has a very normal distribution, with most wines ranging between 3.0 - 3.5, with outliers around 2.7 and 4.0. Most wines are acidic and range between 3.0 - 4.0 on the pH scale. It is possible pH has a greater than expected impact on wine quality so we’ll examine further in the next section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
sulphates is positively skewed with a long-tail distribution. Transforming the data shows a fairly normal distribution, similar to density and pH. Most wines contain sulphate levels between 0.3 - 1.4, as our distribution shows. There are some outliers around the 2.0 level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
alcohol is positively skewed and when transforming the data the distribution remains the same. The majority of wines have an alcohol content between 9.5 - 10.0, which is interesting as the average content is 11.5% - 13.5%. Alcohol is the third of three variables I suggested will have a greater effect on wine quality.
Let’s summarise across quality:
## # A tibble: 6 x 6
## quality mean_alc median_alc min_alc max_alc n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 8.4 11.0 10
## 2 4 10.265094 10.000 9.0 13.1 53
## 3 5 9.899706 9.700 8.5 14.9 681
## 4 6 10.629519 10.500 8.4 14.0 638
## 5 7 11.465913 11.500 9.2 14.0 199
## 6 8 12.094444 12.150 9.8 14.0 18
The mean and median alcohol content increases with higher rated wines, indicating a positive correlation.
There are 1,599 wines in the data set with 11 numeric engagement variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfer.dioxide, density, pH, sulphates, alcohol). Quality, an integer, is our categorical variable and rating is an ordered factor.
Most of the distributions are positively skewed with some variables having extreme outliers (residual.sugar, chlorides, sulphates, free.sulfur.dioxide, total.sulfur.dioxide). Two distributions were normal (density, pH). Citric.acid is an interesting distribution and the only variable with a minimum of 0.00. 82.5% of the wines are rated ‘very good’ with quality scores of 5 and 6.
Quality is the main feature of interest in the data set. How certain variables affect quality is the objective of this analysis, ultimately leading us to build an accurate predictive model.
Initially I predicted acidity, sugar and alcohol would have the greatest impact on the quality of wine. However, after plotting and analyzing the univariate distributions it appears density and pH, with their very normal distributions, may have a greater impact on quality.
I created an ordered factor of rating to compare against quality. I considered creating a new variable ‘total.acidity’, comprised of fixed and volatile acidity, but left it alone. Upon further investigation I learned volatile.acidity is the variable that will most effect the quality of wine (vinegary taste).
Citric.acid was an unusual distribution with a min of 0.00, perhaps due to missing data. Also, quality is unusual in that we only have wines ranked 3 - 8. Where are wines 1-2 and 9-10? Again, perhaps a limited data set. Without lowest & highest quality wines to consider will our analysis be clear cut?
For optics and potential future analysis, I did tidy the data by renaming column ‘X’ to ‘wine.id’ and creating a factored variable called ‘rating’ from quality scores.
Let’s first create a correlation matrix of the variables in our data set. A correlation matrix is used to investigate the dependence between multiple variables at the same time. The resulting table contains the correlation coefficients between each variable and the others.
Correlation: Negative correlation is a relationship between two variables in which one variable increases as the other decreases, and vice versa. A perfect negative correlation is represented by the value -1.00 (x increases, y decreases or vice versa), while a 0.00 indicates no correlation, and a +1.00 indicates a perfect positive correlation (x & y increase/decrease in tandem).
For the following analysis, let’s utilize boxplots to analyze each variable more closely against the above correlation matrix.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$fixed.acidity)
## t = 4.5953, df = 1597, p-value = 4.661e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06558369 0.16234953
## sample estimates:
## cor
## 0.1142376
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
Better rated wines have higher mean & median values of fixed.acidity, indicating a slight positive correlation (0.12), although not significant as the mean & median values are simialr as quality increases. Fixed.acidity doesn’t seem to affect wine quality.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$volatile.acidity)
## t = -16.99, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4319851 -0.3489201
## sample estimates:
## cor
## -0.3912492
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
High volatile.acidity negatively impacts the quality of wine (-0.39). Both the median and mean levels decrease as wine quality increases. Outliers are greatest in lower quality wines and the interquartile range decreases as quality improves.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$citric.acid)
## t = NaN, df = 1597, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## NaN NaN
## sample estimates:
## cor
## NaN
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Most wines have small quantities of citric.acid. Higher quality wines appear to contain slightly more citric.acid, indicating a slightly positive correlation (0.23), peaking at a level of around 0.50. Interestingly, 0.00 levels of citric.acid are found in all wines except for the higest quality of 8 in our data set.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$residual.sugar)
## t = 0.94071, df = 1597, p-value = 0.347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02551727 0.07247084
## sample estimates:
## cor
## 0.02353331
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
residual.sugar has zero affect on quality of wine, in fact the best instance in our data set of having no correlation at all. The mean, median & interquartile range all hover around the same values. Lower rated wines have the most outliers, perhaps due to a limited data set. The log10 plot shows much of the same.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$chlorides)
## t = -7.1508, df = 1597, p-value = 1.308e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2232336 -0.1282260
## sample estimates:
## cor
## -0.17614
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
Most wines have very low quantities of chlorides. There is a weak correlation between chlorides and quality. Outliers are found in lower quality wines. The log10 plot indicates lower amounts of chlorides are found in higher quality wine. The interquartile range is smallest for 8-quality wines. The correlation improves by 0.05 with a log10 transformation.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$free.sulfur.dioxide)
## t = -2.0041, df = 1597, p-value = 0.04522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.098865884 -0.001068979
## sample estimates:
## cor
## -0.05008749
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
This is an interesting plot. The correlation is close to 0.00. Wines of lower and higher quality have less free.sulfur.dioxide and similar means and medians. Average wines have slightly higher levels of free.sulfur.dioxide. The log10 plot tells the same story.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$total.sulfur.dioxide)
## t = -6.8999, df = 1597, p-value = 7.476e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2173510 -0.1221403
## sample estimates:
## cor
## -0.1701427
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
There is little to no correlation between total.sulfur.dioxide and wine quality, as means, medians, and interquartile ranges of the best and worst wines are similar. There are extreme outliers in 7-quality wines interestingly.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$density)
## t = -7.1103, df = 1597, p-value = 1.74e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2222860 -0.1272452
## sample estimates:
## cor
## -0.1751737
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
Higher quality wines appear to have lower density. The best wine has the lowest median. However the upper quartile of the best wine is within the other medians. Given the plot and correlation, the relationship is not very strong. It would be interesting to look at the relation between density and alcohol as higher alcohol content may affect both density and quality.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$pH)
## t = -2.3046, df = 1597, p-value = 0.02132
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106294995 -0.008576923
## sample estimates:
## cor
## -0.05757386
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
It looks as though there is a stronger relationship between pH and quality than the trend line or correlation suggest. The mean and median of pH decrease as the quality of wine increases. The dispersion of data points in better wines is still quite large. There is little to no correlatin between pH and quality, but perhaps in relation to other variables (SO2, acidity) we will gain some meaningful insights.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$sulphates)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2636092 0.3523323
## sample estimates:
## cor
## 0.3086419
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
There is a relationship between better wines and more sulphates, albeit it slight, especially if we eliminate extreme outliers. Better wines appear to have higher concentrations of sulphates. The interquartile range for the best quality wines is rather small. Correlation between the two variables gets stronger by 0.05 after a log10 transformation.
##
## Pearson's product-moment correlation
##
## data: red$quality and red$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: red$quality and log10(red$alcohol)
## t = 21.687, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4382062 0.5139842
## sample estimates:
## cor
## 0.4769811
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Higher quality wines appear to contain more alcohol. This is the clearest example of a strong, positive correlation from our data set. The mean alcohol content is greater in higher quality wines. The trend is so strong that the lower quartile for 8-quality wines is greater than the upper quartile for 6-quality wines. With many outliers, it is possible alcohol alone doesn’t contribute to better quality wine.
Alternative plot from revision:
ggplot(data = red, aes(x = factor(quality), y = alcohol)) +
geom_jitter(alpha = 1/10) +
geom_boxplot(alpha = 1/10, color = 'blue') +
stat_summary(fun.y = 'mean', geom = 'point', color = 'red') +
labs(x = 'Quality (score between 3 and 9)',
y = 'Alcohol (% by volume)',
title = 'Boxplot of alcohol across qualities')
Based on the bivariate analysis of plots and correlations, I now believe the following variables will best predict wine quality:
By definition, no single variable was strongly correlated with wine quality. Surprisingly sugar had the least affect on wine quality (0.01), despite my initial prediction as being one of three that would.
Let’s zoom in and take a closer look:
More wines of ‘very good’ (6,7 quality) contain sugar content across the spectrum, although lesser ‘good’ (4,5 quality) wines also contain higher content. It doesn’t appear sugar content greatly effects wine quality, but we’ll analyze further in the next section.
Fortunately, two of the three indicate a moderate correlation to wine quality and could possibly be used to develop an accurate, predictive model. Alcohol (0.48) and volatile.acidity (-0.39) should work well as both are less correlated to each other (-0.2) than each is to quality respectively.
Taken alone, density, pH, citric.acid, and fixed.acidity have weak to zero correlation with quality, however taken together with certain other variables does inform wine quality: alcohol : density (-0.50) density : fixed.acidity (0.67) pH : fixed.acidity (-0.68) pH : citric.acid (-0.54)
Volatile.acidity (acetic acid), which is produced during fermentation, had a moderately negative correlation with wine quality.
Another variable that interested me was citric.acid. It has relatively high (negative) correlation with both volatile.acidity and pH, and exhibits a somewhat high correlation, relatively, with density. Previously, I did not have a grasp on how acidity affects wine, let alone the chemical components and interactions with other variables.
Relative to quality alcohol had the strongest, positively correlated relationship (0.48)
Relative to other variables, fixed.acidity and pH were the strongest, negatively correlated variables (-0.68); while fixed.acidity and density / citric.acid were the strongest, positively correlated variables (both 0.67).
As our analysis has evolved, so too has my understanding of what makes for better quality red wine. The variables I initially predicted would affect wine quality were acidity, sugar, and alcohol.
We will continue analyzing acids and alcohol, as they are major wine constituents and contribute greatly to its taste. Since we are working with a red wine data set, and not sweet wines, residual.sugar is less compelling and we will not consider it for our model. However, two other important variables need a closer look: pH and SO2.
pH plays a role in the stability of wine, while free.sulfur.dioxide is an effective preservative. Understanding the relationship between pH and sulfur dioxide (SO2) is critical. The higher the pH, the less SO2 will be in the useful free form and the less effective this free SO2 will be.
In this section, I’ll explore many variables at once, examine relationships among our main variables (pH, free.sulfur.dioxide, volatile.acidity, alcohol), and look for patterns predicting wine quality.
alcohol and volatile.acidity are the two most correlated variables with quality in our data set, as well as two of our main variables. The plot confirms that in better quality wines there is less volatile.acidity and more alcohol. Most wines with volatile.acidity > 0.60 are lower quality and have less alcohol content, < 10.0.
density and volatile.acidity are two of the least correlated variables with each other. Wine quality appears to be higher when volatile.acidity and density are low.
Regardless of density, most of the higher quality wines contain > 11% alcohol.
Alternative plots from revision:
ggplot(data = red,
aes(x = density,
y = alcohol,
color = quality)) +
coord_cartesian(xlim = c(0.985, 1.002),
ylim = c(7.5, 15)) +
geom_jitter(size = 1) +
geom_smooth(method = 'lm') +
scale_x_continuous(breaks = seq(0.985, 1.002, 0.002)) +
scale_color_brewer(type = 'seq', guide = guide_legend(title = 'Quality levels')) +
theme_dark()
ggplot(data = red,
aes(x = density, y = alcohol, color = factor(quality))) +
coord_cartesian(xlim = c(0.985, 1.005),
ylim = c(5, 15)) +
geom_jitter() +
scale_color_brewer(type = 'seq') +
theme_dark() +
labs(x = 'Density (mg/l)',
y = 'Alcohol (% by volume)',
title = 'Relationship of density VS alcohol with colored quality levels')
alcohol and sulphates have low correlation to each other, but taken together present an interesting pattern. Wines with higher alcohol content typically had lower SO2 content vs. wines with lower alcohol content. Higher quality wines have more alcohol (> 10.0) and fewer sulphates (> 0.80).
sulphates and citric.acid are two higher correlated variables with quality (after alcohol & volatile.acidity). In this plot it appears wine improves in quality with greater concentration of sulphates, though regardless of citric acid. The outliers tend to be lesser quality wines.
pH has a weak correlation with quality and citric.acid is one of the higher correlations with pH. Better quality wine tends to contain more pH and citric acid.
Most measurments are for free.sulfur.dioxide (active) as the majority of bound SO2 (part of total SO2) is no use as a preservative. Low pH, (more acid), requires higher % of SO2 for quality wine.
From the analysis in this section we can see certain patterns emerge against a lot of noisy plots. Let’s examine the three most correlated variables to quality, but eliminate the mid-range (quality 5 & 6) wines and see if the visualization is clearer &/or shows us anything different.
Most high quality wines tend to contain low volatile.acidity, low sulphates, and high alcohol. There is still a lot of noise and overlap, indicating no one variable can entirely predict good wine quality.
A linear regression model is created to predict quality based on the physiochemical properties of our red wine data set. I forsee a couple of shortcomings negatively impacting our ability to predict wine quality with high degree of confidence: * 1) biased data set as wine is limited to red variants of the Portuguese “Vinho Verde” type * 2) based on our rating, there are neither ‘classic (10)’ nor ‘not recommended (0,1)’ wines (the vast majority are average quality)
I ran four linear regression models, ultimately submitting a model containing variables analyzed from the multivariate plot section. No single model was too far off from the others nor, more importantly, did any of the predictions of quality engender significant confidence. Perhaps a more complete data set would increase the model’s prediction.
##
## Calls:
## lm1: lm(formula = as.numeric(quality) ~ (alcohol), data = red)
## lm2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity,
## data = red)
## lm3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## pH, data = red)
## lm4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## pH + free.sulfur.dioxide, data = red)
## lm5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## pH + free.sulfur.dioxide + density, data = red)
## lm6: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## pH + free.sulfur.dioxide + density + sulphates, data = red)
## lm7: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## pH + free.sulfur.dioxide + density + sulphates + citric.acid,
## data = red)
##
## =========================================================================================================================
## lm1 lm2 lm3 lm4 lm5 lm6 lm7
## -------------------------------------------------------------------------------------------------------------------------
## (Intercept) -0.125 1.095*** 2.269*** 2.275*** -9.743 1.486 -13.084
## (0.175) (0.184) (0.369) (0.369) (10.762) (10.791) (11.928)
## alcohol 0.361*** 0.314*** 0.330*** 0.329*** 0.338*** 0.319*** 0.339***
## (0.017) (0.016) (0.017) (0.017) (0.019) (0.019) (0.020)
## volatile.acidity -1.384*** -1.279*** -1.283*** -1.282*** -1.161*** -1.329***
## (0.095) (0.099) (0.099) (0.099) (0.100) (0.116)
## pH -0.422*** -0.412*** -0.377** -0.289* -0.465***
## (0.115) (0.116) (0.120) (0.119) (0.134)
## free.sulfur.dioxide -0.001 -0.001 -0.002 -0.002
## (0.002) (0.002) (0.002) (0.002)
## density 11.839 0.004 15.173
## (10.596) (10.646) (11.891)
## sulphates 0.645*** 0.668***
## (0.104) (0.104)
## citric.acid -0.382**
## (0.135)
## -------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.323 0.323 0.324 0.340 0.343
## adj. R-squared 0.226 0.316 0.321 0.321 0.321 0.337 0.340
## sigma 0.710 0.668 0.665 0.665 0.665 0.658 0.656
## F 468.267 370.379 253.328 190.155 152.397 136.411 118.592
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1615.101 -1614.724 -1614.097 -1594.978 -1590.942
## Deviance 805.870 711.796 705.845 705.512 704.959 688.301 684.835
## AIC 3448.114 3251.628 3240.202 3241.447 3242.195 3205.956 3199.884
## BIC 3464.245 3273.136 3267.087 3273.710 3279.835 3248.973 3248.279
## N 1599 1599 1599 1599 1599 1599 1599
## =========================================================================================================================
Given our data set, the r-squared statistic, a measure of how well the model is fitting the actual data, is low at 34.3%. R-squared is a measure of the linear relationship between our predictor variable (alcohol) and our response / target variable (quality). In multiple regression settings in our model, we see r-squared always increases as more variables are included in the model. Therefore, adjusted r-squared is the preferred measure, as it adjusts for the number of variables considered. Our r-squared value is 34.0%. Surprisingly, alcohol only contributes 22% of the wine quality.
The relationship between more alcohol and less volatile.acidity in better quality wine was clearly expressed. These were the two highest correlated variables with quality.
Additionally, density & volatile.acidity, sulphates & citric.acid, alcohol & density, and pH & free.sulfur.dioxide all displayed patterns for higher quality wine.
Introducting pH and free.sulfur.dioxide into the equation strengthened patterns affecting wine quality. pH was an important parameter to understand, measuring acidity (acetic) and affecting taste (vinegary). Understanding pH relationship with SO2 was also crucial in our multivariate analysis.
Wines with higher alcohol content contained lower SO2 content. Higher quality wines had alcohol levels > 10.0 and sulphate levels < 0.80. It was interesting seeing this visualization.
I created a linear regression model with variables analyzed from the multivariate plot section: alcohol, volatile.acidity, pH, free.sulfur.dioxide, density, sulphates, citric.acid. The model only fit the data at 34.0%, for reasons discussed above.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The quality of wines in our data set range from 3 to 8. 82.5% (1,319) of the wines are rated 5 or 6. Only 1.8% (28) of wines are rated 3 and 8 respectively. Not a single wine is ‘low’ quality at 1 or 2, nor ‘high’ quality at 9 or 10. As we determined throughout the analysis, the red wine data set is limited in comprehensive quality ratings and biased to a specific region and type of grape.
Volatile.acidity summary:
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
Sulphates summary:
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
Taking the three highest correlated variables with red wine quality, I created histograms and boxplots to present slightly different visualizations of each previously created: * Alcohol: The distribution shown has a positive skew. As such, the mean is larger than the median. We know the higher the alcohol content, the better the wine. Nevertheless, it would have been better to have more observations of higher &/or lower alcohol content to better inform our linear regression model. * Volatile.acidity: The second most influential variable in our data set reltaed to wine quality, albeit weak to moderately correlated. The highest quality wines bottom out in their acetic acid content, indicating a support level for the best wines. From the summary we see that the medians of 7 and 8 quality wines are exactly the same (0.37) and the inter-quartile range between the two is only 0.048. Furthermore, the inter-quartile ranges all overlap, indicating volatile.acidity alone doesn’t affect quality. * Sulphates: This plot illustrates the varying concentration of sulphates in our wines. As previously noted, this is an inverse relation to alcohol in higher quality wines; i.e. higher quality wines have more alcohol (> 10.0) and fewer sulphates (> 0.80). The shaded region corresponds with wines rated 5 or 6 quality.
Simultaneous visualization of the three most influential variables realted to quality. The majority of wines, rated 5 and 6, have been eliminated to avoid excess clutter.
Most high quality wines tend to contain low volatile.acidity, low sulphates, and high alcohol; while the exact opposite occurs in low quality wines. There is still a lot of noise and overlap of the colors and sizes, indicating no one combination, of even these three most-correlated variables, can entirely predict good wine quality. There are a few high quality wines with lower alcohol and sulphate concentrations, however none that are heavy on volatile.acidity.
Having completed the analysis of the red wine data set, and looking back at my initial goal of understanding red wine quality with respect to pairing it with chocolate, I believe I have a much stronger understanding of the winemaking process as a whole, and more specifically which chemical properties potentially determine wine quality. I say ‘potentially’ because we ultimately found that there isn’t a single variable, nor set of variables, that could predict wine quality with high confidence. Bearing in mind winemaking is a science and quality is objective, some interesting challenges arose.
The data set was flawed from the beginning. The sample was limited (too small, lack of extremes) and biased (grape types from single area in Portugal). Too many wines of average quality created a lot of noise, which isn’t necessarily helpful for building a linear model to predict wine quality. In the author’s notes the way the quality rating is calculated (median of subjective evaluations) could help explain why most of the wines are of quality five or six.
In the univariate plot section, the objective was to understand the distribution of a variable and check for anomalies and outliers. Using appropriate plots we learned how to quantify and visualize individual variables within a data set and check for outliers.
In the bivariate plot section, we explored variables to identify the most important relationships and patterns within our data set; calculating correlations and investigating conditional means.
Lastly, in the multivariate plot section, we utilized powerful methods and visualizations for examining relationships among multiple variables; such as reshaping data frames and using aesthetics like color and shape to uncover more information.
Despite the challenges, walking through these three sections of analysis provided some clarity in terms of trends and influential variables related to wine quality.
We saw that variables were generally positively or normally distributed. Alcohol and volatile.acidity were the two most correlated variables with quality, the former positively (> alcohol = > quality) and the latter negatively (< volatile.acidity = > quality). The case of the missing citric.acid values was eventually solved from reading additional wine resources explaining it is added to some wines to increase acidity, but not all; hence the 0.00 measures. Initially, I believed sugar would be a key component for measuring wine quality. It was not, but I quickly learned about fermentation and the properties of sweet vs. dry wine. I would have thought sulphates were a negative predictor of quality, but as we learned they are used as a preservative and the plot told us they exist, but not at the detriment of good quality. Also surprising, pH had little to no correlation with quality, although we later learned the crucial relationship between pH & SO2.
At the end of the day, and because winemaking is a science and quality is objective, we are ultimately searching for the right balance. The winemaker focuses on the chemical properties and the consumer can ‘subjectively’ choose his or her quality wine along measures of sweetness, acidity, tannin, alcohol and body.
For future analysis, the following would be helpful additions to our red wine data set:
Udacity - https://www.udacity.com/
Udacity Discussion Forum - https://discussions.udacity.com/c/nd002-data-analysis-with-r
Clarke, Oz, Introducing Wine: A Complete Guide for the Modern Wine Drinker,
Published November 1st 2004 by Harvest Books (first published 2000).
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at (@Elsevier): http://dx.doi.org/10.1016/j.dss.2009.05.016
Pre-press (pdf): http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
bib: http://www3.dsi.uminho.pt/pcortez/dss09.bib
Github - https://github.com/
Miscellaneous:
The Australian Wine Research Institute
Waterhouse Lab, UCDavis, University of California
MoreThanOrganic, French Natural Wine
Quench
ResearchGate
Miller, Mike. How SO2 & pH are Linked